About the project

I am feeling curious, excited and a bit nervous as well. I expect to learn about programming and data analysis in R langauage, so far I have been using spss. I want to also learn how to analyse big data and find patterns in data that could lead to new findings. I found the course through UEF WebOodi.

R markdown

Formatting text

I have written this portion where you can find
  1. textwritten in different formats
  2. clickable links
image

image

Here is the link to my GitHUB repository https://github.com/saranglq/IODS-project

Here is the link to my diary page https://saranglq.github.io/IODS-project/

Another link to my diary page


RStudio Exercise 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

I started this week with learning R through the Data Camp platform. I learned how to handle (wrangle) data in R. I learned how to modify existing datasets, merge different datasets, and carry out simple and multiple linear regressin analyses. I also learned how to make graphical representations of these analyses.

Dataset

The dataset given to us was from the “International survey of Approaches to Learning” funded by the Teacher’s Academy funding for Kimma Vehkalahti. Data was collected during 2013-2015

Oridinal dataset had 183 observations for 60 variables. Variables with prefix A adn C were based on componenets of A and C parts of ASSIST (Approaches and Study Skills Inventory for Students). Prefix D variables were based on SATS (Survey of Attitudes Toward Statistics). Items from the ASSIST B part were named so that their connection to the relevant dimension (Deep/Surface/Strategic) was maintained

Data wrangling

The dataset had multiple variables and I was supposed to merge assist B components into three major variables (Deep, Surface, Strategy). I also renamed three other variables (Age Attitude, Points -> age, attitude, points). Then I had to exclude other variables after selecting only 7 variables. The resulting dataset I saved as lrn2014.csv which has 7 variables and 166 observations for each variable.

Data distribution

Gender distribution was uneven, there were 110 female and 56 male subjects.

Distributions of each variable (except gender) are given below:

drawing drawing drawing drawing drawing drawing

Following is a summary table for all variables:

Variable Mean Median Min Max
Age 25.51 2.83 17 55
Attitude 31.43 32 14 50
Deep 3.68 3.18 1.58 4.91
Strategy 3.12 3.18 1.25 5.00
Surface 2.78 2.83 1.58 4.33
Points 22.72 23.00 7.00 33.00

Regression

I first carried out an exploratory analysis to find out which variables correlate best with points:

drawing

Attitude, strategy and surface had the strongest corellation and therefore were chosen as the predictor variables

I ran a multiple linear regression model and got following results

drawing

According to the results, attitude is the only variable that has significant linear relationship with points (p < 0.001). Each 1 point increase in attitude results in an increase of 0.34 in points.

Non-significant variables were removed and model was fitted again.

drawing

drawing

According to the final model, attitude has still a significant linear relationship with points, and each 1 point increase in attitude results in an increase of 0.35 in points. Multiple R-squared value is 0.19 which means that attitude explains 19 percent of variance in points.

Assumptions of the multiple linear model are that the

Errors are normally distributed Errors are not correlated Errors have constant variance The size of a given error does ot depend oin the explanatory variables

Diagnostic plots were created to check these assumptions

drawing

QQ-plot demostrates that the errors of the model are normally distributed, there are howeevr values at extremities that deviate from the normal line.

The scatterplot of residuals vs the fitted values shows no pattern, i.e. the spread remains largely similar with the increase in fitted value. Howeverthere are some residuals that are far form the fitted line. These could potentially be outliers

The residuals vs leverage plot shows that none of the eobservations have an unusually high impact on the results.


RStudio Exercise 3: Logistic regression

The current dataset is called “Student Performance Data Set” and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Student+Performance. Data compares student performance in two subjects: Portuguese and Mathematics.G1 referes to grades received during the first period and similary G2 and G3 correspond to grades received in second and third periods.

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)

After data wrangling the dataset was saved as alc.csv which contains 35 variable and 182 observations

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
## [1] 382  35

Hypotheses

I would be interested in examining the relationship between alcohol use and:

  1. Parental cohabitation: I think that tense family relationships can lead to a higher stress in teenagers and therfore a tendency towards more consumption of alcohol.

  2. Mother’s education: Research shows that higher education for women results in multipe positive outcomes for family and society. This could also reflect in the attitudes of children.

  3. Romantic relationships: Failure to bond during teenage years and peer pressure to be in a relationship can result in higher tendency to drink.

  4. Current health status: It might be a causal factor for drinking but also vice versa i.e. excessive drinking could be related to bad health status.

Distribution of variables

Parental cohabitation and romantic relationship are binomial variables. Mother’s eductation and health are categporical, ordinal variables. The bar charts showing their distributions are presented below.

The charts show the distribution of subects based on their relationship status, parental cohabitation and usage of alcohol.

## # A tibble: 4 x 3
## # Groups:   Pstatus [2]
##   Pstatus high_use count
##   <fct>   <lgl>    <int>
## 1 A       FALSE       26
## 2 A       TRUE        12
## 3 T       FALSE      242
## 4 T       TRUE       102
## # A tibble: 4 x 3
## # Groups:   romantic [2]
##   romantic high_use count
##   <fct>    <lgl>    <int>
## 1 no       FALSE      180
## 2 no       TRUE        81
## 3 yes      FALSE       88
## 4 yes      TRUE        33

Logistic regression

None of the chosen variables are significantly associated with high alchohol use. The variables and their resulting odds ratios for high alcohol use are presented along with the confidence intervals. Odds ratios along with confidence intervals for association of each variable with high alcohol use are also presented,

## 
## Call:
## glm(formula = high_use ~ Pstatus + Medu + romantic + health, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4627  -0.8821  -0.7091   1.2911   2.0228  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.10674    1.32994   0.080    0.936
## PstatusT    -0.08875    0.38587  -0.230    0.818
## Medu1       -0.91856    1.27172  -0.722    0.470
## Medu2       -1.90163    1.26062  -1.508    0.131
## Medu3       -0.91270    1.25197  -0.729    0.466
## Medu4       -1.27811    1.24739  -1.025    0.306
## romanticyes -0.15946    0.25057  -0.636    0.525
## health2      0.76463    0.48160   1.588    0.112
## health3      0.13564    0.44268   0.306    0.759
## health4      0.32219    0.45049   0.715    0.474
## health5      0.63142    0.39548   1.597    0.110
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 448.12  on 371  degrees of freedom
## AIC: 470.12
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                    OR       2.5 %    97.5 %
## (Intercept) 1.1126471 0.086924876 27.053858
## PstatusT    0.9150754 0.437244590  2.007650
## Medu1       0.3990923 0.017558764  4.548870
## Medu2       0.1493248 0.006654147  1.665484
## Medu3       0.4014386 0.018078594  4.410220
## Medu4       0.2785625 0.012611116  3.033450
## romanticyes 0.8526065 0.517603874  1.385456
## health2     2.1482059 0.843454699  5.637405
## health3     1.1452705 0.486362057  2.792529
## health4     1.3801401 0.576722981  3.413775
## health5     1.8802695 0.888022503  4.233663

Choosing alternative variables

I chose study time, activities and going out with friends. Distributions of these variables are presented through bar charts.

Based on the distributions I chose to explore gender, study time, activities and going out. Gender and going out were significantly associated with high alcohol use. Study time also had a significant but weak association.

Males were more likely to drink heavily (OR 2.24 CI95% 1.32 - 3.84). Compared to the reference group, students who went out most frequently were more likely to consume alcohol heavily (OR 10.73 CI 95% 3.036 - 51.42)

## 
## Call:
## glm(formula = high_use ~ sex + studytime + activities + goout, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7617  -0.7725  -0.5228   0.8228   2.5617  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.8672     0.6722  -2.778 0.005474 ** 
## sexM            0.8073     0.2721   2.966 0.003012 ** 
## studytime2     -0.3456     0.2970  -1.164 0.244524    
## studytime3     -0.9712     0.4809  -2.019 0.043455 *  
## studytime4     -1.1241     0.6298  -1.785 0.074310 .  
## activitiesyes  -0.4044     0.2590  -1.561 0.118454    
## goout2          0.3503     0.6956   0.504 0.614510    
## goout3          0.5243     0.6813   0.770 0.441507    
## goout4          2.0665     0.6816   3.032 0.002430 ** 
## goout5          2.3735     0.7020   3.381 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 384.35  on 372  degrees of freedom
## AIC: 404.35
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                       OR      2.5 %     97.5 %
## (Intercept)    0.1545556 0.03368531  0.5095379
## sexM           2.2418487 1.32119716  3.8496160
## studytime2     0.7077842 0.39492190  1.2688956
## studytime3     0.3786406 0.14036906  0.9407166
## studytime4     0.3249548 0.08308952  1.0313723
## activitiesyes  0.6673741 0.39987596  1.1063789
## goout2         1.4195109 0.40277427  6.7023293
## goout3         1.6893594 0.49743864  7.8225116
## goout4         7.8972057 2.33914417 36.6844868
## goout5        10.7354006 3.03673066 51.4240176

Predictions using the model

Previously significant variables (sex, studying time, going out) were chosen for this model. The chart represents the high/non “high users” that were users that were classified correctly by the model as high users that were classified correctly or incorrectly. The tabulated chart shows the proportion of subjects classified correctly and incorrectly by the model.

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65183246 0.04973822 0.70157068
##    TRUE  0.15445026 0.14397906 0.29842932
##    Sum   0.80628272 0.19371728 1.00000000

Loss function of the training model was:

## [1] 0.2041885

If a simple guessing strategy would be employed, there would be a 50 percent change of getting a correct answer since the outcome is a binomial outcome. The model however is more efffective as it gives a correct result 80 percent of the time, as demonstrated through the training set.

Bonus exercise

## [1] 0.2303665

The 10 fold cross validation test resulted in a 0.22 error for my model, which is less than 0.26 for the model introduced in DataCamp.

Super bonus

Let us fit all the variables into a logistic regression model.

##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "nursery"     "internet"    "guardian"    "traveltime" 
## [16] "studytime"   "failures"    "schoolsup"   "famsup"      "paid"       
## [21] "activities"  "higher"      "romantic"    "famrel"      "freetime"   
## [26] "goout"       "Dalc"        "Walc"        "health"      "absences"   
## [31] "G1"          "G2"          "G3"          "alc_use"     "high_use"   
## [36] "probability" "prediction"
## 
## Call:
## glm(formula = high_use ~ school + sex + studytime + goout + Pstatus + 
##     Medu + Fedu + Mjob + Fjob + reason + nursery + internet + 
##     guardian + traveltime + studytime + failures + schoolsup + 
##     famsup + paid + activities + higher + romantic + famrel + 
##     freetime + goout + health + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9372  -0.6024  -0.2891   0.3696   2.9525  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -16.00788  998.11272  -0.016 0.987204    
## schoolMS          -0.31881    0.56801  -0.561 0.574610    
## sexM               1.14045    0.36391   3.134 0.001725 ** 
## studytime2        -0.10816    0.39602  -0.273 0.784764    
## studytime3        -0.58458    0.62948  -0.929 0.353053    
## studytime4        -0.51937    0.81835  -0.635 0.525648    
## goout2             0.35067    0.92424   0.379 0.704381    
## goout3             0.39244    0.89417   0.439 0.660744    
## goout4             2.48140    0.90743   2.735 0.006247 ** 
## goout5             2.57466    0.94753   2.717 0.006583 ** 
## PstatusT          -0.64279    0.57431  -1.119 0.263036    
## Medu1             -1.46770    1.63389  -0.898 0.369033    
## Medu2             -2.29763    1.63061  -1.409 0.158818    
## Medu3             -0.97329    1.65010  -0.590 0.555302    
## Medu4             -1.11258    1.71722  -0.648 0.517053    
## Fedu1             13.12642  998.10935   0.013 0.989507    
## Fedu2             13.35623  998.10933   0.013 0.989323    
## Fedu3             13.14915  998.10933   0.013 0.989489    
## Fedu4             13.46401  998.10938   0.013 0.989237    
## Mjobhealth        -1.49344    0.86167  -1.733 0.083061 .  
## Mjobother         -0.81758    0.57177  -1.430 0.152744    
## Mjobservices      -1.17488    0.64412  -1.824 0.068150 .  
## Mjobteacher       -0.54941    0.80208  -0.685 0.493354    
## Fjobhealth         0.51742    1.31546   0.393 0.694071    
## Fjobother          1.19438    0.99984   1.195 0.232255    
## Fjobservices       1.97809    1.02534   1.929 0.053705 .  
## Fjobteacher       -0.58137    1.19604  -0.486 0.626915    
## reasonhome         0.58678    0.43925   1.336 0.181590    
## reasonother        1.59275    0.60414   2.636 0.008379 ** 
## reasonreputation   0.12521    0.45818   0.273 0.784633    
## nurseryyes        -0.38504    0.41007  -0.939 0.347750    
## internetyes       -0.18085    0.47879  -0.378 0.705634    
## guardianmother    -0.65645    0.39486  -1.662 0.096413 .  
## guardianother     -0.70067    0.94726  -0.740 0.459495    
## traveltime2       -0.30249    0.38375  -0.788 0.430558    
## traveltime3        1.91997    0.80682   2.380 0.017328 *  
## traveltime4        3.26053    1.12585   2.896 0.003779 ** 
## failures           0.17111    0.29453   0.581 0.561252    
## schoolsupyes      -0.35779    0.50971  -0.702 0.482709    
## famsupyes         -0.08541    0.35824  -0.238 0.811558    
## paidyes            0.67650    0.35640   1.898 0.057673 .  
## activitiesyes     -0.35070    0.34488  -1.017 0.309210    
## higheryes         -0.11162    0.78940  -0.141 0.887556    
## romanticyes       -0.45218    0.36731  -1.231 0.218300    
## famrel2            0.67295    1.40011   0.481 0.630773    
## famrel3            0.63499    1.27185   0.499 0.617593    
## famrel4           -0.54507    1.24670  -0.437 0.661956    
## famrel5           -1.36245    1.26822  -1.074 0.282688    
## freetime2          1.02423    1.12818   0.908 0.363952    
## freetime3          1.57764    1.09818   1.437 0.150831    
## freetime4          2.01108    1.12377   1.790 0.073521 .  
## freetime5          1.50294    1.19980   1.253 0.210329    
## health2            1.09810    0.72431   1.516 0.129504    
## health3            0.50881    0.68053   0.748 0.454660    
## health4            0.74525    0.68633   1.086 0.277547    
## health5            0.97885    0.61138   1.601 0.109368    
## absences           0.10369    0.03037   3.415 0.000639 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 292.37  on 325  degrees of freedom
## AIC: 406.37
## 
## Number of Fisher Scoring iterations: 14
## [1] 0.1910995
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.2539267

The training error of this model was 0.19 and the tesing error was 0.24

The significant variables were selected and another model was created.

## 
## Call:
## glm(formula = high_use ~ sex + goout + reason + traveltime + 
##     absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7762  -0.7111  -0.4500   0.6284   2.5265  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.31613    0.76071  -4.359 1.31e-05 ***
## sexM              0.97063    0.27099   3.582 0.000341 ***
## goout2            0.49127    0.74900   0.656 0.511890    
## goout3            0.54824    0.73191   0.749 0.453824    
## goout4            2.23550    0.73810   3.029 0.002456 ** 
## goout5            2.35217    0.75767   3.104 0.001906 ** 
## reasonhome        0.36755    0.33529   1.096 0.272983    
## reasonother       1.15017    0.46511   2.473 0.013403 *  
## reasonreputation -0.16018    0.35498  -0.451 0.651819    
## traveltime2      -0.11500    0.30300  -0.380 0.704275    
## traveltime3       1.47622    0.57538   2.566 0.010298 *  
## traveltime4       1.87986    0.92470   2.033 0.042057 *  
## absences          0.08835    0.02377   3.717 0.000202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 360.59  on 369  degrees of freedom
## AIC: 386.59
## 
## Number of Fisher Scoring iterations: 4
## [1] 0.1884817
## [1] 0.2068063

This model had a better performance on the testing set, i.e. testing error was 0.21. Finally I decided to remove reason (to choose school) and travel time variables (because they were weak predictors and the category other for reason variable was hard to determine).

## 
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7601  -0.7278  -0.4717   0.7663   2.2870  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.89554    0.71197  -4.067 4.76e-05 ***
## sexM         1.02291    0.26103   3.919 8.90e-05 ***
## goout2       0.35631    0.73254   0.486 0.626685    
## goout3       0.42736    0.71684   0.596 0.551053    
## goout4       1.99158    0.71679   2.778 0.005461 ** 
## goout5       2.25925    0.73593   3.070 0.002141 ** 
## absences     0.08396    0.02278   3.685 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 380.07  on 375  degrees of freedom
## AIC: 394.07
## 
## Number of Fisher Scoring iterations: 4
## [1] 0.2068063
## [1] 0.2146597

Final model had 3 predictors and a testing set loss function of 0.21.

##   n_variables training_error
## 1          27      0.1910995
## 2           5      0.1884817
## 3           3      0.2068063
##   n_variables testing_error
## 1          27     0.2539267
## 2           5     0.2068063
## 3           3     0.2146597


RStudio Exercise 4: Clustering and classification

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)

Now we shall load the dataset Boston from MASS package and explore it.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
dim(Boston)
## [1] 506  14

The data has 14 variables and 506 measurements. Each measurement corresponds to a suburb of Boston i.e. 506 suburbs in total for this dataset. The variables are described in detail at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Overview of data

Data contains 12 numeric variables and 2 integer variables.

The variables are
  1. Crime per capita: Left skewed.
  2. Proportion of res zone with lots of over 25000 sq. ft.: Left skewed
  3. Proportion of industrial zones: Left skewed with peak at 20%
  4. Bordering Charles river: About 40 towns
  5. NO concentration: Left skewed, from 0.4 to 0.9 PP10M
  6. Avg. number of rooms per dwelling: Normally distributed with peak at 6 to 7
  7. Proportion of owner-occupied units built prior to 1940: Right skewed.
  8. Mean of distances to 5 Boston employment centres: Left skewed. Range form 1 to 12.5 (km? miles?).
  9. Index of accessibility to highways: Range 1 to 24. Uneven, u-shaped distribution.
  10. Property tax rate per $10,000: Uneven distribution.
  11. Pupil to teacher ratio: 12 to 23, right skewed.
  12. Proportion (modified) of Black people by town: Right skewed. Range 0.32 - 396.9.
  13. Lower status of population (in %): Right skewed, peak at 5%.
  14. Median value of homes (owner occupied) in $1000s: Normally distributed, peak at 20.
varlabels = c("per capita crime rate","proportion of residential land zoned","proportion of industrial zones per town","towns bordering charles river N/Y","NO concentration, PP10M","average number of rooms per dwelling","proportion of units built pre-1940", "weighted means of distance to 5 Boston employment centres","index of accessibility to radial highways","property tax rate per $10,000","pupil to teacher ratio", "Proortion of blacks by town","lower status of population","median value of homes in $1000s")


selcols <- colnames(Boston)
selvars <- dplyr::select(Boston, one_of(selcols))
a=0

for(var in selvars) {
   
  if(is.integer(var)){
    a = a +1
    plot <-ggplot(Boston, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("Number of suburbs")
  print(plot)

} else { 
    a = a +1
    plot <- qplot(var,
      geom="histogram",
      binwidth = ((1/7.9) * sd(var)) * 3.49,  
      main = paste("Histogram for", varlabels[a])  ,  
      xlab = (varlabels[a]),  
      fill=I("blue"), 
      col=I("red"), 
      alpha=I(.2))
    print(plot)
    }
}

cor_matrix<-cor(Boston) 

corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The correlation matrix shows that the strongest positive correlations are between 1. Accessibility to highways and property tax rate, 2. Industrialization and NO concentration and 3. Number of rooms per dwelling and median value of homes. The strongest negative corellations are between 1. Proportion of buildings built prior to 1940 and distances to employment centres, 2. NO concentration and proportion of buildings built prior to 1940, 3. Proportion of industrial zones and distances to employment centres and 4 Percent lower status of population and median value of owner occupied homes.

Already many patterns can be drawn from this data. Accessibility to highways means higher development and could mean higher property tax rates. More industrial zones can explain higher NO concentrations. Number of rooms per dwelling could explain the median value of homes as larger homes are more expensive. Negative relationship between percent of lower status population and median value of homes shows that richer 0suburbs have wealthier people.

Standardizing the data

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

bins <- quantile(boston_scaled$crim)

label <- c("low", "med_low", "med_high", "high")

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = label)

boston_scaled <- dplyr::select(boston_scaled, -crim)

boston_scaled <- data.frame(boston_scaled, crime)

After standardizing the data, the mean of all variables has been centered towards the zero and the range has decreased, but still maintaining similar proportions between them.

Now we shall split the data into train and test set.

n <- nrow(boston_scaled)

ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

dim(train)
## [1] 404  14
dim(test)
## [1] 102  14

Linear discriminant analysis

Drawing the LDA bi-plot. Categorical crime rate variable was used as the target and all other variables were used as predictors.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)

lda.arrows(lda.fit, myscale = 3)

Testing the model on test dataset

correct_classes <- test$crime

test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
##           predicted
## correct    low med_low med_high high Sum
##   low       13      18        1    0  32
##   med_low    0      21        5    0  26
##   med_high   1       4       14    1  20
##   high       0       0        0   24  24
##   Sum       14      43       20   25 102

When testing the trained model on test dataset, the model predicted correctly 9 out of 23 low crime, 15 out of 26 med_low crime, 15 out of 23 med_high crime and 30 out of 30 high crime suburbs.

K-means clustering

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
dist_eu <- dist(boston_scaled, method = "euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
km <-kmeans(boston_scaled, centers = 3)

pairs(boston_scaled[6:10], col = km$cluster)

Checking the optimum number of clusters and re-analyzing

set.seed(123)

k_max <- 10

twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled, centers = 4)

pairs(boston_scaled[6:10], col = km$cluster)

I could not find a clear drop in WCSS after two so I set the number of clusters at 3.

If we draw a pairs plot, we can see that we cannot separate the neighborhoods based on two variables. Some variables are good at separating 2 clusters, but almost no combination of two variables can separate three clusters in a good way.

Bonus

So why don’t we try separating the clusters based on LDA instead of pairs? I chose 4 clusters for this purpose because it gives a good speeration of data.

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

set.seed(123)

km <- kmeans(boston_scaled, centers = 4)

lda.fit <- lda(km$cluster~ ., data = boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(km$cluster)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 7)

According to the results from the LDA plot, proportion of black people in town, nitric oxide concentrations, industrialization, tax rate and crime rate were the biggest factors separating suburbs into four categories.

Higher crime rate clustered the suburbs in cluster one. Lesser (or Higher? unclear from the variable description and formula) proportion of black people clustered the towns in the opposite direction. NO concentration, industrialization and tax rate clustered suburbs towards left, i.e. clusters 1 and 2.It can also be seen that these zones were more accessible to radial highways.

superbonus

We predicted crime rate category using other variables and plotted it in 3D!

The individual points were colored in the first graph according to the crime rate category and in the second graph using k means clustering (data was clustered into 4 groups in order to match with four categories of crime rate).

Even though in the second 3D graph the data was clustered using all available data and not specifically to predict crime, note that it still does a pretty good job in separating suburbs into four groups according to crime rate.

#PLOTTING 3D GRAPH FOR TRAIN SET WITH CRIME

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]

km <- kmeans(train, centers = 4)

bins <- quantile(train$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(train$crim, breaks = bins, include.lowest = TRUE, label = label)
train <- data.frame(train, crime)

lda.fit <- lda(crime ~ ., data = train)

classes <- as.numeric(train$crime)

model_predictors <- dplyr::select(train, -crime)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)